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Abstract 

We study the performances of an adaptive procedure based on a convex 
combination, with data-driven weights, of term- by-term thresholded wavelet 
estimators. For the bounded regression model, with random uniform design, 
and the nonparametric density model, we show that the resulting estimator is 
optimal in the minimax sense over all Besov balls under the L 2 risk, without 
any logarithm factor. 

1 Introduction 

Wavelet shrinkage methods have been very successful in nonparametric function esti- 
mation. They provide estimators that are spatially adaptive and (near) optimal over 
a wide range of function classes. Standard approaches are based on the term-by-term 
thresholds. A well-known example is the hard thresholded estimator introduced by 
[2T] . If we observe n statistical data and if the unknown function / has an expansion 
of the form / = ^2 ■ J2k Pj,kipj,k where {ipj,k, j, k} is a wavelet basis and (Pj,k)j,k is the 
associated wavelet coefficients, then the term-by-term wavelet thresholded method 
consists in three steps: 

1. a linear step corresponding to the estimation of the coefficients /3j± by some 
estimators f3j t k constructed from the data, 
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2. a non-linear step consisting in a thresholded procedure T\((3j ! k)'B. n^. k \>\.\ where 
A = (Xj)j is a positive sequence and T\(f3j^) denotes a certain transformation 
of the f3j y k which may depend on A, 

3. a reconstruction step of the form fx = J2jen n T\($j,k)'&f0. k \>\. j.V'j.fe where 
fi n is a finite set of integers depending on the number n of data. 

Naturally, the performances of fx strongly depend on the choice of the threshold 
A. For the standard statistical models (regression, density,...), the most common 
choice is the universal threshold introduced by [2T] . It can be expressed in the form: 
A* = where A* = c^J (logn)/n where c > denotes a large enough constant. In 
the literature, several technics have been proposed to determine the 'best' adaptive 
threshold. There are, for instance, the RiskShrink and SureShrink methods (see 
[2"U| |2"T]). the cross-validation methods (see [33], [53] and [3T]), the methods based 
on hypothesis tests (see [JJ and [2]), the Lepski methods (see [33]) and the Bayesian 
methods (see [17] and [3]). Most of them are described in detailed in [33] and [3J. 

In the present paper, we propose to study the performances of an adaptive wavelet 
estimator based on a convex combination of /\'s. In the framework of nonparametric 
density estimation and bounded regression estimation with random uniform design, 
we prove that, in some sense, it is at least as good as the term- by-term thresh- 
olded estimator fx defined with the 'best' threshold A. In particular, we show that 
this estimator is optimal, in the minimax sense, over all Besov balls under the L 2 
risk. The proof is based on a non-adaptive minimax result proved by [19] and some 
powerful oracle inequality satisfied by aggregation methods. There are two steps in 
our approach. A first step, called the training step, where non-adaptive thresholded 
wavelet estimators are constructed for different thresholds. A second step, called 
learning step, where an aggregation scheme is worked out to realize the adaptation 
to the smoothness. 

The exact oracle inequality of Section 2 is given in a general framework. Two 
aggregation procedures satisfy this oracle inequality. The well known ERM (for 
Empirical Risk Minimization) procedure (cf . [51] , [38J and references therein) and an 
exponential weighting aggregation scheme, which has been studied, among others, by 
[3J, [S], [10], [3TJ and [32]. There is a recursive version of this scheme studied by [13J, 
[33] , [33J and [3B] . In the sequential prediction problem, weighted average predictions 
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with exponential weights have been widely studied (cf. e.g. [52] and P3]). A recent 
result of [12] shows that the ERM procedure is suboptimal for strictly convex losses 
(which is the case for density and regression estimation when the integrated squared 
risk is used). Thus, in our case it is better to combine the f\'s, for A lying in a 
grid, using the aggregation procedure with exponential weights than using the ERM 
procedure. Moreover, from a computation point of view the aggregation scheme with 
exponential weights does not require any minimization step contrarily to the ERM 
procedure. 

The paper is organized as follows. Section 2 presents general oracle inequalities 
satisfied by two aggregation methods. Section 3 describes the main procedure of the 
study and investigates its minimax performances over Besov balls for the L 2 risk. 
All the proofs are postponed in the last section. 



2 Oracle Inequalities 

2.1 Framework 

Let (Z, T) a measurable space. Denote by V the set of all probability measures on 
(Z, T). Let F be a function from V with values in an algebra T . Let Z be a random 
variable with values in Z and denote by it its probability measure. Let D n be a 
family of n i.i.d. observations Zi,...,Z n having the common probability measure 
7i. The probability measure it is unknown. Our aim is to estimate F(tt) from the 
observations D n . 

In our estimation problem, we assume that we have access to an "empirical risk". 
It means that there exists Q : Z x T i — > R such that the risk of an estimate / G T 
of F(tt) is of the form 

A(f) = E [Q(Z, /)] . 

In what follows, we present several statistical problems which can be written in this 
way. If the minimum over all / in T 

A* = min A(f) 

is achieved by at least one function, we denote by /* a minimizer in T . In this paper 
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we will assume that minj e jrA(/) is achievable, otherwise we replace /* by /*, an 
element in T satisfying A(f*) < infj e jr A(f) + n~ x . 

In most of the cases /* will be equal to our aim F(ir) up to some known additive 
terms. We don't know the risk A, since 7r is not available from the statistician, thus, 
instead of minimizing A over JF we consider an empirical version of A constructed 
from the observations D n . The main interest of such a framework is that we have 
access to an empirical version of A(f) for any / G T . It is denoted by 



We exhibit three statistical models having the previous form of estimation. 

Bounded Regression: Take Z = X x [0,1], where (X,A) is a measurable 
space, Z = (X, Y) a couple of random variables on Z, with probability distribution 
7r, such that X takes its values in X and Y takes its values in [0, 1]. We assume that 
the conditional expectation E[F|X] exists. In the regression framework, we want to 
estimate the regression function 



Usually, the variable Y is not an exact function of X. Given is an input X G X, 
we are not able to predict the exact value of the output Y G [0, 1]. This issue can 
be seen in the regression framework as a noised estimation. It means that in each 
spot X of the input set, the predicted label Y is concentrated around E [Y\X] up to 
an additional noise with null mean denoted by (. The regression model can then be 
written as 

Y = E[Y\X] + (. 

Take T the set of all measurable functions from X to [0,1]. Define H/H^px) = 
f x f 2 (x)dP x (x) for all functions / in L 2 (X, A, P x ) where P x is the probability 
measure of X. Consider 




(1) 



f*(x) =E[Y\X = x], VxeX. 



Q((x,y),f) = (y-f(x)) 2 , 



(2) 



for any (x, y) G X x R and / G T . Pythagore's Theorem yields 



A(f) = E [Q((X, Y), /)] = ||/* - /||! w + E [C 2 ] • 



4 



Thus /* is a minimizer of A(f) and A* = E[£ 2 ]. 

Density estimation: Let (Z, T, /i) be a measured space. Let Z be a random 
variable with values in Z and denote by 7r its probability distribution. We assume 
that 7r is absolutely continuous w.r.t. to fi and denote by /* one version of the 
density. Consider T the set of all density functions on (Z,T,fi). We consider 

Q(z,f) = -log f(z), 

for any z G Z and / G T . We have 

A(f) = E [Q(Z, /)] = K(f* \f) - J log(f*(z))d7r(z). 

Thus, /* is a minimizer of A(f) and A* = — f z \og(f*(z))dir(z). 

Instead of using the Kullback-Leiber loss, one can use the quadratic loss. For this 
setup, consider T the set L 2 (Z, T, //) of all measurable functions with an integrated 
square. Define 

Q(z,f)= J fd f i-2f(z), (3) 
for any z G Z and / G T . We have, for any / G J 7 , 

A(f) = E [Q(Z, /)] = ||r - fWbw - J 2 (f*(z)) 2 dv(z). 

Thus, /* is a minimizer of A(f) and A* = — j z (f*(z)) 2 djji(z). 

Classification framework: Let (X,A) be a measurable space. We assume 
that the space Z = X x {—1,1} is endowed with an unknown probability measure 
7i. We consider a random variable Z = (X, Y) with values in Z with probability 
distribution tt. We denote by P x the marginal of 7r on X and r\{x) = F(Y = 1\X = x) 
the conditional probability function of Y — 1 knowing that X = x. Denote by T 
the set of all measurable functions from X to K. Let be a function from K to R. 
For any / G T consider the 0— risk 

A(f)=E[Q((X,Y),f)], 

where the loss is given by Q((x,y), f) = (f>(y f (x))foi any (x,y) G X x {—1, 1}. 

Most of the time a minimizer /* of the 0— risk A over T or its sign is equal to 
the Bayes rule f*(x) = Sign(2r/(x) - l),Vx G X (cf. (SHJ). 
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In this paper we obtain an oracle inequality in the general framework described 
at the beginning of this Subsection. Then, we use it in the density estimation and 
the bounded regression frameworks. For applications of this oracle inequality in the 
classification setup, we refer to [JT] and [30] . 

Now, we introduce an assumption which improve the quality of estimation in our 
framework. This assumption has been first introduced by [13], for the problem of 
discriminant analysis, and [50J, for the classification problem. With this assumption, 
parametric rates of convergence can be achieved, for instance, in the classification 
problem (cf. [5TJ], [IS]). 

Margin Assumption(MA): The probability measure tt satisfies the margin as- 
sumption MA(k, c, Tq), where k > 1, c > and T§ is a subset of J 7 if 

E[(Q(Z, /) - Q(Z, D) 2 ] < c(A(f) - A*) 1/k , 

for any function f e J-q. 

In the bounded regression setup, it is easy to see that any probability distribution 
7r on X x [0, 1] naturally satisfies the margin assumption MA(1, 16, .Fi), where T\ is 
the set of all measurable functions from X to [0, 1]. In density estimation with the 
integrated squared risk, all probability measures 7r on (Z, T) absolutely continuous 
w.r.t. the measure fi with one version of its density a.s. bounded by a constant 
B > 1, satisfies the margin assumption MA(1, 16B 2 , Tb) where Tb is the set of all 
non-negative function / e L 2 (Z,T, //) bounded by B. 

Actually, the margin assumption is linked to the convexity of the underlying 
loss. In density and regression estimation it is naturally satisfied with the better 
margin parameter k = 1, but, for non-convex loss (for instance in classification) 
this assumption does not hold naturally (cf. [12] for a discussion on the margin 
assumption and for examples of losses which does not satisfied naturally the margin 
assumption with parameter k — 1). 

2.2 Aggregation Procedures 

Let's work with the notations introduced in the beginning of the previous Subsection. 
The aggregation framework considered, among others, by [31], [51], [15], [15]. [49J, 
[5], [6] is the following: take T§ a finite subset of J 7 , our aim is to mimic (up to an 
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additive residual) the best function in JF w.r.t. the risk A. For this, we consider 
two aggregation procedures. 

The Aggregation with Exponential Weights aggregate (AEW) over T§ is defined 

by 

fY EW) = Y, w(n) (M w 

where the exponential weights w^ n \f) are defined by 

m C)( /)= ^(-"^W) V /e*. (5) 

E 9e ^ ex P(-^n(^)) 
We consider the Empirical Risk Minimization procedure (ERM) over T defined 

by 

2.3 Oracle Inequalities 

In this Subsection we state an exact oracle inequality satisfied by the ERM proce- 
dure and the AEW procedure (in the convex case) in the general framework of the 
beginning of Subsection 12.11 From this exact oracle inequality we deduce two others 
oracle inequalities in the density estimation and the bounded regression framework. 
We introduce a quantity which is going to be our residual term in the exact oracle 
inequality. We consider 



7(n, M, k, To, 7r, Q) 



otherwise, 



where B{Tq, tt, Q) denotes minj e jr (A(f) — A*), k > 1 is the margin parameter, n is 
the underlying probability measure, Q is the loss function, 



3 = min ( lQ S 2 1 J_\ (?) 

Pl mm V96cA' 8(4c + AT/3)' 576c/' U 

and 

. /I 3 log 2 1 

A " mm U ' "W 2(16c + A/3)' ' (8) 

where the constant c > appears in MA(k, c, JF ). 
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Theorem 1. Consider the general framework introduced in the beginning of Subs ec- 
tion \2.1\ Let JF denote a finite subset of M elements fi, . . . , fu in F, where M > 2 
is an integer. Assume that the underlying probability measure tc satisfies the mar- 
gin assumption MA(k,c, Tq) for some k > l,c > and \Q(Z,f) — Q(Z,f*)\ < K 
a.s., for any f e T§, where K > 1 is a constant. The Empirical Risk Minimization 
procedure satisfies 

E[A(fi ER W) — A*] < min (A(fj) - A*) + 4 7 (n, M, k, F q , tt, Q). 

j=1,...,M 

Moreover, if f i — ► Q(z, f) is convex for it -almost z 6 Z, then the AEW procedure 
satisfies the same oracle inequality as the ERM procedure. 

Now, we give two corollaries of Theorem [1] in the density estimation and bounded 
regression framework. 

Corollary 1. Consider the bounded regression setup. Let fx, ... , fu be M functions 
on X with values in [0, 1]. Let f n denote either the ERM or the AEW procedure. For 
02 defined in and for any e > 0, we have 

nil* - fn\\h {pX) ] < (1 + e) . min (||/* - + ^f^- 

3=1,— ,M y ' €P2U 

Corollary 2. Consider the density estimation framework. Assume that the under- 
lying density function f* to estimate is bounded by B > 1. Let fi, . . . , /m be M 
functions bounded from above and below by B. Let f n denote either the ERM or the 
AEW procedure. For (3 2 defined in (f2[) and any e > 0, we have 

nwr - aha z C 1 + ^^jwr - + o) 

In both of the last Corollaries, the ERM and the AEW procedures can both be 
used to mimic the best fj among the f/s. Nevertheless, from a computational point 
of view the AEW procedure does not require any minimization step contrarily to 
the ERM procedure. Moreover, from a theoretical point of view the ERM procedure 
can not mimic the best fj among the fj's as fast as the cumulative aggregate with 
exponential weights (it is an average of AEW procedures). For a comparison between 
these procedures we refer to |42j. The constants of aggregation multiplying the 
residual term in Theorem [T] and in both of the following Corollaries come from the 
proof and are certainly not optimal. We did not make any serious attempt to optimize 
them. 
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3 Multi-thresholding wavelet estimator 



In the present section, we propose an adaptive estimator constructed from aggre- 
gation technics and wavelet thresholding methods. For the density model and the 
regression model with uniform random design, we show that it is optimal in the 
minimax sense over a wide range of function spaces. 

3.1 Wavelets and Besov balls 

We consider an orthonormal wavelet basis generated by dilation and translation of 
a compactly supported "father" wavelet and a compactly supported "mother" 
wavelet if). For the purposes of this paper, we use the periodized wavelets bases on 
the unit interval. Let 

4> jth = 2 j/2 (f)(2 j x - k), ip j>k = 2 j/2 ip(2 j x - k) 

be the elements of the wavelet basis and 

Cfc fa) = Y fo> k ( x - l ^ = Y - o» 

lez lei 

there periodized versions, defined for any x G [0, 1], j G N and k G {0, . . . , 2 J — 1}. 
There exists an integer r such that the collection ( defined by ( = {4^^, k = 0, 2 T — 
1; i^ki J — T i ---j 00 ; k = 0, 2 J — 1} constitutes an orthonormal basis of L 2 ([0, 1]). 
In what follows, the superscript "per" will be suppressed from the notations for 
convenience. For any integer I > r, a square-integrable function /* on [0, 1] can be 
expanded into a wavelet series 

2 l -l oo 2^-1 

k=0 j=l k=0 

where a^k = Jo f*( x ) ( t ) j,k( x )dx and (3j^ = J f*{x)ij)j^(x)dx. Further details on 
wavelet theory can be found in [33] and [18] . 

Now, let us define the main function spaces of the study. Let M G (0,oo), 
s G (0,N), p G [l,oo) and q G [1,cxd). Let us set /3 r -i,fc = «r,fc- We say that a 
function /* belongs to the Besov balls (M) if and only if the associated wavelet 
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coefficients satisfy 




1 /pi qi 1 /i 



[ £ [2 J(s+1/2 - 1/p) (Ei^i p ) 



< M. 



if qe [l,oo), 




with the usual modification if q = oo. We work with the Besov balls because of their 
exceptional expressive power. For a particular choice of parameters s, p and q, they 
contain the Holder and Sobolev balls (see pH]). 

3.2 Term-by-term thresholded estimator 

In this Subsection, we consider the estimation of an unknown function /* in L 2 ([0, 1]) 
from a general situation. We only assume to have n observations gathered in the 
data set D n from which we are able to estimate the wavelet coefficients ctj^ and 
[3jk of /* in the basis (. We denote by atjk and (3jk such estimates. Finally, let us 
mention that all the constants of our study are independent of /* and n. 

Definition 1 (Term-by-term thresholded estimator). Let ji be an integer satisfying 
(n/logra) < 2- 71 < 2(n/logn). For any integer I > r, let X = (A;, ...AjJ be a vector 
of positive integers. Let us consider the estimator fx : D n x [0, 1] — > R defined by 



where for all u G (0, oo) the operator T n is such that there exist two constants 
Ci,C 2 > satisfying 



for any x G R and t/GM. 

The inequality ffTTj) holds for the hard thresholding rule T^ ard (x) = xI/m^w}, the 
soft thresholding rule T^ t (x) = sign(x)(\x\ — u)lL{\ x \^ u } (see [2TJ, [22] and [19] ) and 
the non-negative garrote thresholding rule T^ G (x) — (x — u 2 /x) JLn x i^ u y (see [26]). 

If we consider the minimax point of view over Besov balls under the integrated 
squared risk, then [TjJ] makes the conditions on Sij^, j3j^ and the threshold A such that 
the estimator f\(D n , .) defined by ([TO]) is optimal for numerous statistical models. 
This result is recalled in Theorem [5] below. 




(10) 



k=0 



j=r k=0 



T n (x) - y\ 2 < d(mm(y, C 2 u) 2 + (\x - y\ 2 )^\x-y\>2-^}) 



(11) 
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Theorem 2 (Delyon and Juditsky (1996)). Let us consider the general statistical 
framework described in the beginning of the present section. Suppose that the two 
following assumptions hold. 

• Moments inequality: There exists a constant C > such that, for any j G 
{r — 1, k G {0, ...,2 J — 1} and n large enough, we have 

^(\Pj,k — Pj,k\ 4 ) — Cn~ 2 , where we take $ T -i t k = ctr,fc- (12) 

• Large deviation inequality: There exist two constants C > and > such 
that, for any a,j G {r, ...,ji}, k G {0, ...,2 J — 1} and n large enough, we have 

P (2>/^|4fc - Pj,k\ > P*V^) < C2~ 4a . (13) 

Let us consider the term-by-term thresholded estimator f v . (D n , .) defined by ^^) 
with the threshold 

V js — (P*(j — js) + )j=T,...,j 1 , 

where j s is an integer such that n 1 ^^ 2 ^ < 2 Js < 2n 1 ^ 1+2s K Then, there exists a 
constant C > such that, for any p G [1, oo] ; s G (l/p,N], q G [1, oo] and n large 
enough, we have: 

sup E[\\f v jD n , .) - /*||£ 2([0)1]) ] ^ Cn- 2 */^. 

The rate of convergence V n = n ~ 2s /i 1 + 2s ) i s minimax for numerous statistical 
models, where s is a regularity parameter. For the density model and the regression 
model with uniform design, we refer the reader to [TH] for further details about the 
choice of the estimator (3j^ and the value of the thresholding constant p*. Starting 
from this non- adaptive result, we use aggregation methods to construct an adaptive 
estimator at least at good in the minimax sense as f v .(D n , .). 

3.3 Multi-thresholding estimator 

Let us divide our observations D n into two disjoint subsamples D m , of size m, made 
of the first m observations and D®, of size /, made of the last remaining observations, 
where we take 

I = [n/logn] and m = n — I. 
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The first subsample D m , sometimes called "training sample", is used to construct 
a family of estimators (in our case this is thresholded estimators) and the second 
subsample D®, called the "training sample", is used to construct the weights of the 
aggregation procedure. 

Remark 1. From a theoretical point of view we can take m = I which means that 
we use as many observations for the estimation step as for the learning step. But, 
in practice it is better to use a greater part of the observations for the construction 
of the estimators and the last observations for the aggregation procedure, because 
if the basis estimators that we aggregate, are not good, then the obtained aggregate 
is likely to be as bad as the prior estimators. Another interesting thing is that we 
can split the whole sample D n in many different ways. For instance we can take m 
observations randomly in D n to form the training subsample and the last remaining 
observations for the learning subsample. We can also take an average of different 
aggregates constructed from different splits of the initial sample D n and by a simple 
argument of convexity it is easy to prove that the averaged aggregate has a better risk 
than the others aggregates constructed only from one split. 

Definition 2. Let us consider the term-by-term thresholded estimator described in 
/ T70j) . Assume that we want to estimate a function f* from [0, 1] with values in [a, b}. 
Consider the projection function 



We define the multi-thresholding estimator /„ : [0, 1] — > [a, b] at a point x £ [0, 1] 
by the following aggregate 



where A n = {0, logn} ; v u = (p(j — w)+)i=T,-.ii> V« £ A n and p is a positive 
constant depending on the model worked out and 



h a<b (y) = max(a,mm(y,b)),Vy £ E. 



(14) 



f n (x) = w^(h a , b {f Vu (D m ,.)))h a>b (f Vv {D m ,x)) 



(15) 



w 
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where A^ l \f) = j Y2i=m+i Qi^i, f) is the empirical risk constructed from the I last 
observations, for any function f and for the choice of a loss function Q depending 
on the model considered (cf. (TJ|) and fiBj) for examples). 

The principle of the construction of the multi-thresholding estimator f n is to 
use aggregation technics to easily construct an adaptive optimal estimator of /*. It 
realizes a kind of 'adaptation to the threshold' by selecting the best threshold v u for 
u describing the set A n . Since we know that there exists an element in A n depending 
on the regularity of /* such that the non-adaptive estimator f Vu (D m , .) is optimal 
in the minimax sense (see Theorem [2]), the multi-thresholding estimator is optimal 
independently of the regularity of /*. 



4 Performances of the multi-thresholding estima- 
tor 

This section is devoted to the minimax performances of the multi-thresholding esti- 
mator defined in (fl5|) under the L 2 ([0, 1]) risk over Besov balls. Firstly, we consider 
the framework of the density model. Secondly, we focus our attention on the bounded 
regression with uniform random design. Finally, we compare these results with some 
well-known wavelet thresholded procedures. 



4.1 Density model 

In the density estimation model, Theorem [3] below investigates rates of convergence 
achieved by the multi-thresholding estimator (defined by ffTol) ) under the L 2 ([0,1]) 
risk over Besov balls. 

Theorem 3. Let us consider the problem of estimating f* from the density model. 
Assume that there exists B > 1 such that the underlying density function f* to 
estimate is bounded by B. Let us consider the multi-thresholding estimator defined 
in / [73]) where we take a = 0,6 = B , p such that 

>4(log2) 



8 J B + (8p/(3v / 2))(|H|oc + 5) 
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and 

oc jt k = -j^j^jXi), (3j )k = -s]i>j,k(Xi)- (16) 

i=i i=i 

Then, there exists a constant C > such that 

sup E[||/ n -r||i 2([01]) ]^Cn- 2s /( 2s+1 ), 
/*ei?|, q (L) 

/or any p G [1, oo], s G (p _1 , iV], r6 [1, oo] and integer n. 

The rate of convergence V n = n - 2s /( 1 + 2s ) i s minimax over B s v q (L). Further details 
about the minimax rate of convergence over Besov balls under the L 2 ([0, 1]) risk for 
the density model can be found in [19] and [29] . For further details about the density 
estimation via adaptive wavelet thresholded estimators, see [23], [19] and [27]. See 
also [30] for a practical study 



4.2 Bounded regression 

In the framework of the bounded regression model with uniform random design, The- 
orem H] below investigates the rate of convergence achieved by the multi-thresholding 
estimator defined by (TT5T) under the L 2 ([0, 1]) risk over Besov balls. 

Theorem 4. Let us consider the problem of estimating the regression function f* 
in the bounded regression model with random uniform design. Let us consider the 
multi-thresholding estimator [T5\) with p such that 

„2 



>4(log2) 



+ (8p/(3 v / 2))(|Hloo + l 



and 



^ n 1 n 

"i>* = _ Y $j,k{Xi), Pj, k ^ Y#j) jtk (Xi). (17) 

71 i=l n i=l 

Then, there exists a constant C > such that, for any p G [l,oo], s G (p ,N], 
q G [1, oo] and integer n, we have 

sup n\\L-nh m]) ]^cn-^^\ 

/*eB£ ? (L) 
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The rate of convergence V n = n 



2s/(l+2s) 



is minimax over Bp (L). The multi- 



thresholding estimator has better minimax properties than several other wavelet 
estimators developed in the literature. To the authors's knowledge, the result ob- 
tained, for instance, by the hard thresholded estimator (see [21]), by the global 
wavelet block thresholded estimator (see [37]), by the localized wavelet block thresh- 
olded estimator (see 0121 HQ], US [27], [22 [25], [H] and p]) and, in particular, 
the penalized Blockwise Stein method (see [H]) are worse than the one obtained by 
the multi-thresholding estimator and stated in Theorems [3] and [H This is because, 
on the difference of those works, we obtain the optimal rate of convergence without 
any extra logarithm factor. 

In fact, the multi-thresholding estimator has similar minimax performances than 
the empirical Bayes wavelet methods (see [55] and [32]) and several term- by-term 
wavelet thresholded estimators defined with a random threshold (see [33] and [7J). 

Finally, it is important to mention that the multi-thresholding estimator does 
not need any minimization step and is relatively easy to implement. 

5 Proofs 

Proof of Theorem [TJ. We recall the notations of the general framework introduced 
in the beginning of Subsection 12.11 Consider a loss function Q : Z x T i — > M, the 
risk A(f) = K[Q(Z, /)], the minimum risk A* = minj e jr A(f), where we assume, 
w.o.l.g, that it is achieved by an element /* in T and the empirical risk A n (f) = 
(1/n) X/ILi Q(Zii f)i f° r an y / £ ? . The following proof is a generalization of the 
proof of Theorem 1 in [39J . 

We first start by a 'linearization' of the risk. Consider the convex set 



M 



C = |(0i, . . . , 6 M ) :9j>0 and ^ % = 1 



} 



and define the following functions on C 



M M 



A(o) = and MO) = 




which are linear versions of the risk A and its empirical version A 
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Using the Lagrange method of optimization we find that the exponential weights 



w = (w'"'(/j))i<j'<M are the unique solution of the minimization problem 

1 M 

min (A n {9) + -J2 9 i lo S^ : fa* ■■■,0m) eC 

H 3=1 

where we use the convention OlogO = 0. Take j G {1, . . . , M} such that A n (fj) 
min J=1 v . >jAf A n (fj). The vector of exponential weights w satisfies 

logM 



AJw) < AJe*) + 



71 



where ej denotes the vector in C with 1 for j-th coordinate (and elsewhere). 

Let e > 0. Denote by Ac the minimum min^c A{9). We consider the subset of C 



def 



V = \9 e C : A(9) >Ac + 2e\. 



Let x > 0. If 



A {9) -A*- (A n (9) - AJJ*)) e 

sup = < — , 

eev A(6) - A* + x A c - A* + 2e + x 

then for any 6> G £>, we have 

A n (0) - MH > MO) -A*- f j m ~f + *\ >A c -A* + e, 

(A c - A* + 2e + x) 

because A(0) - A* > A c - A* + 2e. Hence, 



P 



M[A n (9)-A n (f*)) <A c -A* + e 



< P 



A(0)-A*-(A»(0)-A»(/*)) e 
sup = > — 

eeD A(6>) - A* + x A c - A* + 2e + x 



Observe that a linear function achieves its maximum over a convex polygon at 
one of the vertices of the polygon. Thus, for j e {1, . . . , M} such that A(ej ) = 
min J=lj i iM A(ej) (= min J=lj i i)M A(/j)), we have A(e.j ) = min^gc A{9). We obtain 
the last inequality by linearity of A and the convexity of C. Let w denotes either the 
exponential weights w or ej. According to (fl8l) . We have 

~ . ~ . . logM ~ . . logM 
A(w) < mm A n (ej) H < A n (e jo ) H 
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So, if A(w) > Ac + 2e then w G T> and thus, there exists 9 e "D such that 
< ^n(e j0 ) - A n (/*) + (log M)/n. Hence, we have 



A(w) > A c + 2e 



< P 



inf - A n (f*) < A n (e j0 ) - A n (f*) + 



logM" 



< P 
+P 

< P 
+P 

If we assume that 



inf 4,(0)- A n (f*) <A c -A* + e 



eev 



A n (e j0 ) - A n (f) > Ac - A* + e 



logM 



n 



A (9) - A* - (A n (f) - A n (f*)) ^ 
eec A(6) - A* + x A c - A* + 2e + x 



A n (e jo )-A n (f*)>A c -A* + e- 



logM 



n 



eec A(6) - A* + x A c - A* + 2e + x 



then, there exists # (0) = (9 { ® ] , . . . , 0$) e C, such that 
A(0(°)) - A* - (A n (0M) - A n (f*)) 



> 



A(0W)-A* + x A c -A* + 2e + x 

The linearity of A yields 

j(gW) -A*- (4>(fl(°>) - A n (r)) _ ElLiOfWi) -A*- (A,^) - A n (f*)) 
A(e(°))-A*+x " " " Ef=iOf ) [A(f j )-A* + x] 

and since, for any numbers a±, . . . , om and positive numbers bi, ... , 6m, we have 



< max 



then, we obtain 



max A(f j )-A*-(A n (f,)-A n (r)) y 
j=i,-,M A(fj)-A* + x ' Af -A* + 2e + x' 



where = f min i=li ... iM A(fj) (= A c ) 
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p 



Now, we use the relative concentration inequality of Lemma [T] to obtain 

e 



HldA. 77— r ; ^> — ; 

j 1 m A(fj) - A* + x Ajt - A* + 2e + x 



< M 1 + 



4c(Af - A* + 2e + a;) 2 a; 



2 1/k 



n ex 



+M 1 + 



4K(A^ - A* + 2e + 
3nex 



exp 
exp 



4c(Ajr - A* + 2e + x) 2 x 1 / K 
3nex 

' AK(Aj: - A* + 2e + x) 



Using the margin assumption MA(/c, c, JF ) to upper bound the variance term and 
applying Bernstein's inequality, we get 

logM" 



P 



Mfh) - A n (f) > A? - A* + e 



n 



< 



exp 



n(e - (log M)/n) 



2c(A^ - A*) 1 /" + (2AT/3)(e - (logM)/n) 

for any e > (log M)/n. From now, we take x = Ajr — A* + 2e, then, for any 
(log M)/n < e < 1, we have 

n(e — logM/n) 2 



P L4(w) > A To +2e < exp 



M 1 



2c(A To - A*) 1 /* + (2K/3)(e - (log M)/n) 



32c(Af - A* + 2e) 1 / K 



exp 



32c(Ajr -^* + 2e)V« 



+ M 1 + 



32 
3ne 



exp 



3ne 
^2" 



If w denotes e~ 3 then, = A(e^) = A(f( ERM ^). If to denotes the vector of 

exponential weights w and if / 1 — > Q(z,f) is convex for 7r-almost z G Z, then, 
A(w) = A(w) > A(fi AEW) ). If / 1 — > Q(z, /) is assumed to be convex for 7r-almost 
z E Z then, let /„ denote either the ERM procedure or the AEW procedure, other- 
wise, let f n denote the ERM procedure fn ERM \ We have for any 2(logM)/n < u < 1, 



E[A(f n )-Ar ]<E\A(w)-Ar 
where 

Ti(e) = exp 



< 2m + 2 



[T 1 (e) + M(T 2 (e)+T 3 (e)))de, (19) 



'«/2 

n(e - {log M)/n) 2 



2c{A Fo - v4*) 1 / K + (2AT/3)(e - (log M)/n) 



1 + 



16c(Af - A* + 2e) 1 / K 



exp 



ne 



16c(4f - A* + 2e)V« 
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and 

T s(e) = 1 + — ) exp 



3neJ V 8AT, 

We recall that /?i is defined in (J7J). Consider separately the following cases (CI) 
and (C2). 

(CI) The case A Fo - A* > {(log M) / (Pm))^ 2 ^. 

Denote by (J,(M) the unique solution of //q = 3Mexp(— /x ). Then, clearly 
(logAf)/2 < /i(M) < logAf. Take it such that 

(nfiiu 2 )/ (Aj7 - A*) 1/k = //(M). 

Using the definition of case (1) and of fJ.(M) we get it < A? — A* . Moreover, 
u > 4 log M/ n, then 

u/2 1K} ~ Ju/2 V (2c + ^/6)(^ -A*)V«; 

ex P - 77— TTT^TuZ de - 



Using Lemma [2] and the inequality u < A^ — A*, we obtain 

' x S(4c + K/3)(Ajr — A*) l l K ( nu 2 



J' [ ' ](h ~ nu° 6XP V &{4c + K/3)(An-A*y/> 



We have 16c{A^ — A* + 2it) < nu 2 thus, using Lemma [H we get 

ne 2 

' Uc(A H - A*) l l K 



Ue)de < 2 exp ( -— A ) de 

u/2 Ju/2 

+2 / exp ( — — ) de 

J(A To ~A*)/2 



128c 



We have 16 (3™)- 1 < u < A Ta - A*, thus, 



E 



A{f n ) ~ A 



To 



<2u + 6M V u - — - — exp 



n(3 lU "V 
19 



(20) 



2148c(^ - A*)V" ( nv 2 . 

S nu P V 2148c(^ -A*)iA- ' ' 



1 m , x , 16K(A^ n - A*) 1/k / 3nu 2 
T 3 (e)de< 1 — e xp — — — . 22 

From (1201). pTp . (1221 and fTT^T) we obtain 

.{A^-A*) 1 '* ( nfru 2 



The definition of u leads to E 



A 



T 



< 4 



{A To -A*y/K\ogM 
n/3i 



k/(2k-1) 



(<g2) The case A Ta - A* < ((log M)/((3in)) 

We now choose u such that nfou^ 2 ^ 1 ^ ' K = /x(M), where /x(M) denotes the unique 
solution of fj, = 3Mexp(— /i ) and f3 2 is defined in (|8]). Using the definition of case (2) 
and of n(M) we get u > Af -A* (since Pi > 2/3 2 ). Using the fact that u > 4 log M/n 
and Lemma [21 we have 



1 2(16c+X/3) 

Ti(e)de< ^^j- exp 

u/2 ™ ' V 2(16c + A/3) 

We have u > (128c/n) K ^ 2K ~^ and using Lemma [21 we obtain 



(23) 



1 m , , , 256c 

, r2 ( e ) de ^^i=i7^ ex i ) 



raw 



2-1/k 



Since « > 1QK/ (3n) we have 



u/2 



16# 



256c 



16K 



(24) 



(25) 



From (1231). plj) , (1231) and we obtain 



E 



A{f n ) ~ 4*, 



< 2u + 6M 



The definition of u yields E A(f n ) — A^ 



exp [—nP 2 u i - 2K 1 ^ K ) 

. This completes the proof. 



<4( logM 

n/32 



Lemma 1. Consider the framework introduced in the beginning of Subsection \2.1\ 
LetJ r = {/i,-..,/a/} be a finite subset of J 7 . We assume that n satisfies MA (k, c, jF 0y ) ; 
for some k > 1, c > and \Q(Z,f) — Q(Z,f*)\ < K a.s., for any f G T§, where 
K > 1 is a constant. We have for any positive numbers t, x and any integer n 

A(f) - A n (f) - (A(f*) - A n (r)) 



F 



max 



< M 



A(f) - A* + x 

Acx l / K \ ( n(tx^ 2 
exp 1 



> t 



AK \ 



exp 



3ntx s 



n(tx) 2 J \ 4cx 1 / K J ' y 3ntx J L ' \ 4K J 
Proof. We use a "peeling device". Let x > 0. For any integer j, we consider 



?i = {f e T : jx < A(f) - A* < (j + l)x} 
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Define the empirical process 
Z.(f) 



A(f) - md - (Mr) - Mr)) 



A(f)-A* + x 

Using Bernstein's inequality and margin assumption MA(k, c, JF ) to upper bound 
the variance term, we have 



P 



max Z x (f) > t 



+00 

3=0 



max Z x (f) > t 



< X> ™*Mf) - Mf) - (Mn - Mr)) > tu + 1; 



X 



3=0 



^ / n[t(j + l)x} 2 \ 

< ^ exp - — - ^ 1/k - (2Ar/3)t(j + l)x ) 



3=0 
+00 

< m(y. 

3=0 

< M^exp 
+M 



exp 



n(tx) 2 (j + 1) 2 -V K \ / , ,3ntx\ 

^ I- exp (-0 + 1) 



4X / 



4c 



exp 



' 4K J 



exp 



Ac 



u 2 1/k I + exp 



3ntx 



AK 



u )du. 



Lemma [2] completes the proof. 

Lemma 2. Let a > 1 and a, b > 0. An integration by part yields 

exp(— ba a ) 



r+00 

/ exp (-bt a ) dt < 

J a 



aba ' 1 

Proof of Corollaries [T]and El In the bounded regression setup, any probability 
distribution tt on X x [0, 1] satisfies the margin assumption MA(1, 16, J-\), where T\ 
is the set of all measurable functions from X to [0, 1]. In density estimation with the 
integrated squared risk, any probability measure 7r on (Z,T), absolutely continuous 
w.r.t. the measure // with one version of its density a.s. bounded by a constant 
B > 1, satisfies the margin assumption MA(1, 16-B 2 , Tb) where Tb is the set of all 
non-negative function / G L 2 (Z, T, fi) bounded by B. To complete the proof we use 
that for any e > 0, 

( BW ,, Q )lo 8 M y/^ WM 

V pin / p2ne 
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and in both cases / i — > Q(z, f) is convex for any 

Proof of Theorem [31 We apply Theorem [2J with e = 1, to the multi- 
thresholding estimator f n defined in ( 115ft . Since the density function /* to estimate 
takes its values in [0,5], Card(A n ) = logra and m > n/2, we have, conditionally to 
the first subsample D m , 



E[iir-/»ii^( M) \d. 

< 2 min (| |f - h 0>B (fv, (Dm, ■)) I \Uo,m) + 



4(logw) log(logw) 



. o • n\f* t m M12 \ , 4(logn)log(logn) 
< 2 mm(||/ - f Vu {D m , .)\\ L 2 {m) ) 



where h 0i g is the projection function introduced in (fT4"|) and /3 2 is given in (|S]). Now, 
for any s > 0, let us consider j s an integer in A„ such that n, 1 /( 1 + 2 *) < 2- 7s < 2n 1 ^ 1+2s ' ) . 
Since the estimators and /^j- defined by (|T6|) satisfy the inequalities (fT2]) and 
(|T3l) . Theorem [2] implies that, for any p G [1, 00], s G (1/p, iV], g G [1, 00] and n large 
enough, we have 

sup E[||/-r||i 2([0jl]) ]= sup E[E[||/-r||i 2([0)1]) |D m ]] 

/*6B| i9 (L) /*GB«,,(L) 



2 4(logn) log(logn) 



< 2 sup E[min(||r-/ t)u (L> m ,.)|i! 2( r ,in] + 



< 2 sup E[||r-/ c . s (L> m ,.)||i 2no , in ] + 



4(logra) log(logw) 



This completes the proof of Theorem [3j 

Proof of Theorem 31 The proof of Theorem H] is similar to the proof of Theorem 
[HI We only need to prove that, for any j G {r, ji} and k G {0, ...,2 J — 1}, the 
estimators and 0^ defined by f[T7|) satisfy the inequalities ffl~2]) and f[T3|) . First 
of all, let us notice that the random variables Yiipj >k (Xi), Y n ipj tk (X n ) are i.i.d and 
that there m— th moment, for m > 2, satisfies 

E(fe(Xi)n < Hviir^'^-^Ed^^i)! 2 ) = IHir 2 2 J ' (m/2 - 1} - 

For the first inequality (cf. inequality (fT2|) ). Rosenthal's inequality (see 
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P-241]) yields, for any j G {t, ji} 



< C\\Y\amUn- 3 2^ +n-*) <Cn-\ 



For second inequality (cf. inequality (lT3"j) ). Bernstein's inequality yields 



( 2v ^|/3, fc - M > p^) < 2 exp ( - 8ff2 + m P J pVE/{2V ^ ) > 



where a G {r, ji}, p G (0, oo), 



m = ||y^W-/3i, fc ||oo<2^ 2 ||r|| oo ||^|| oo + ||ri|| 2([0il]) 

< 2^ 2 (|H| 0O + 1) < 2 1 / 2 (n/logn) 1 / 2 (||^|| 00 + 1), 



and 



a 2 = Edy^-kpfi) - /3 iifc | 2 ) < Edy^fe^i)! 2 ) < II^IIL < i- 



Since a < logn, we complete the proof by seeing that for p large enough, we have 
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